## here() starts at /home/alex/3ieo/ieoproject-stubibi

1 Introduction

In autoimmune type 1 diabetes (T1D), the gradual decline of functional β-cell mass initiates the onset of hyperglycemia and necessitates lifelong insulin therapy. However, early-stage dysregulation of glucagon secretion poses a significant challenge to effective disease management. This dysregulation is characterized by inadequate suppression during hyperglycemia and insufficient secretion during hypoglycemia. The specific mechanisms underlying this dysregulated glucagon secretion in T1D remain unclear. However, it is needed to highlight that both the parasympathetic and sympathetic nervous systems play crucial roles in regulating islet hormone secretion to maintain glycemic balance.

Laser microdissection of intact human islets followed by islet RNA-seq offers an initial step towards understanding sympathetic nervous system-related gene pathways without the inherent stress associated with islet and single-cell isolation procedures. All islets utilized in the study of Campbell-Thompson et al. (2021) are from donors without diabetes (ND) and donors with auto-antibodies (AAb) contained β-cells, whereas only a few residual INS-positive β-cells were found in rare islets from T1D donors. Hence, this model was also employed to investigate how the loss of paracrine insulin secretion could influence α-cell gene expression.

The resulting raw RNA-seq data have been deposited at the Gene Expression Omnibus (GEO), where they are publicly available under accession [GSE181674] (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE181674).

Here, we show a first exploratory analysis of the corresponding RNA-seq gene expression profiles generated as a table of counts using the DEE2 (https://dee2.io) pipeline by Ziemann, Kaspi, and El-Osta (2019), and further packaged into a [SummarizedExperiment] (https://bioconductor.org/packages/SummarizedExperiment) object with genes mapped to Entrez identifiers. This object also stores the phenotypic information about the profiled samples that has been also made available at GEO.

The question of interest to address in this work is whether gene expression differs in the islets from donors without diabetes (ND), donors with auto-antibodies (AAb), and donors with type 1 diabetes (T1D).

2 Quality assessment

2.1 Data import and cleaning

We start importing the raw table of counts from the GSE181674.rds file.

library(SummarizedExperiment)

se <- readRDS(file.path(system.file("extdata", package="IEOproject"), "GSE181674.rds"))
se
class: RangedSummarizedExperiment 
dim: 25118 18 
metadata(4): experimentData annotation ensemblVersion urlProcessedData
assays(1): counts
rownames(25118): 1 10 ... 9994 9997
rowData names(5): gene_id gene_biotype description gene_id_version
  symbol
colnames(18): SRR15372438 SRR15372439 ... SRR15372454 SRR15372455
colData names(39): title geo_accession ... disease state:ch1 tissue:ch1

We have 25118 genes by 18 samples. From the first row and column names shown by the object, we can figure out that genes are defined by Entrez (Maglott et al. 2010) identifiers and samples by Sequence Read Archive Run (SRR) identifiers.

First, we explore our SummarizedExperiment object.

The rowData has information about the features profiled in the assay.

head(rowData(se))
DataFrame with 6 rows and 5 columns
                  gene_id   gene_biotype            description
              <character>    <character>            <character>
1         ENSG00000121410 protein_coding alpha-1-B glycoprote..
10        ENSG00000156006 protein_coding N-acetyltransferase ..
100       ENSG00000196839 protein_coding adenosine deaminase ..
1000      ENSG00000170558 protein_coding cadherin 2 [Source:H..
10000     ENSG00000117020 protein_coding AKT serine/threonine..
100008586 ENSG00000236362 protein_coding G antigen 12F [Sourc..
             gene_id_version      symbol
                 <character> <character>
1         ENSG00000121410.11        A1BG
10         ENSG00000156006.4        NAT2
100       ENSG00000196839.12         ADA
1000       ENSG00000170558.8        CDH2
10000     ENSG00000117020.16        AKT3
100008586  ENSG00000236362.8     GAGE12F

colnames(rowData(se))
[1] "gene_id"         "gene_biotype"    "description"     "gene_id_version"
[5] "symbol"         

dim(rowData(se))
[1] 25118     5

We can see information about the genes: gene_id (ensembl id), gene_biotype, description, gene_id_version and symbol. Among these, the gene symbol and description are potentially useful for interpreting results of, for instance, a differential expression analysis.

The colData contains phenotypic information about the samples.

head(colData(se), n=3)
DataFrame with 3 rows and 39 columns
                  title geo_accession                status submission_date
            <character>   <character>           <character>     <character>
SRR15372438    AAb_6267    GSM5509128 Public on Aug 09 2021     Aug 09 2021
SRR15372439    AAb_6424    GSM5509129 Public on Aug 09 2021     Aug 09 2021
SRR15372440    AAb_6429    GSM5509130 Public on Aug 09 2021     Aug 09 2021
            last_update_date        type channel_count        source_name_ch1
                 <character> <character>   <character>            <character>
SRR15372438      Aug 10 2021         SRA             1 laser microdissected..
SRR15372439      Aug 10 2021         SRA             1 laser microdissected..
SRR15372440      Aug 10 2021         SRA             1 laser microdissected..
            organism_ch1    characteristics_ch1  characteristics_ch1.1
             <character>            <character>            <character>
SRR15372438 Homo sapiens tissue: laser microd.. disease state: non-d..
SRR15372439 Homo sapiens tissue: laser microd.. disease state: non-d..
SRR15372440 Homo sapiens tissue: laser microd.. disease state: non-d..
            characteristics_ch1.2 molecule_ch1   extract_protocol_ch1
                      <character>  <character>            <character>
SRR15372438             biorep: 1    polyA RNA PicoPure RNA extract..
SRR15372439             biorep: 2    polyA RNA PicoPure RNA extract..
SRR15372440             biorep: 3    polyA RNA PicoPure RNA extract..
            extract_protocol_ch1.1 extract_protocol_ch1.2   taxid_ch1
                       <character>            <character> <character>
SRR15372438 Agilent 2100 Bioanal.. A sample of 6 ng RNA..        9606
SRR15372439 Agilent 2100 Bioanal.. A sample of 6 ng RNA..        9606
SRR15372440 Agilent 2100 Bioanal.. A sample of 6 ng RNA..        9606
                   data_processing      data_processing.1
                       <character>            <character>
SRR15372438 Quality control perf.. Reads mapped to huma..
SRR15372439 Quality control perf.. Reads mapped to huma..
SRR15372440 Quality control perf.. Reads mapped to huma..
                 data_processing.2      data_processing.3
                       <character>            <character>
SRR15372438 Gene expression quan.. A multi-dimensional ..
SRR15372439 Gene expression quan.. A multi-dimensional ..
SRR15372440 Gene expression quan.. A multi-dimensional ..
                 data_processing.4      data_processing.5  data_processing.6
                       <character>            <character>        <character>
SRR15372438 Empirical Bayes tech.. Network analysis and.. Genome_build: hg38
SRR15372439 Empirical Bayes tech.. Network analysis and.. Genome_build: hg38
SRR15372440 Empirical Bayes tech.. Network analysis and.. Genome_build: hg38
                 data_processing.7      data_processing.8
                       <character>            <character>
SRR15372438 Supplementary_files_.. Supplementary_files_..
SRR15372439 Supplementary_files_.. Supplementary_files_..
SRR15372440 Supplementary_files_.. Supplementary_files_..
                 data_processing.9 platform_id data_row_count
                       <character> <character>    <character>
SRR15372438 Supplementary_files_..    GPL21290              0
SRR15372439 Supplementary_files_..    GPL21290              0
SRR15372440 Supplementary_files_..    GPL21290              0
               instrument_model library_selection library_source
                    <character>       <character>    <character>
SRR15372438 Illumina HiSeq 3000              cDNA transcriptomic
SRR15372439 Illumina HiSeq 3000              cDNA transcriptomic
SRR15372440 Illumina HiSeq 3000              cDNA transcriptomic
            library_strategy               relation             relation.1
                 <character>            <character>            <character>
SRR15372438          RNA-Seq BioSample: https://w.. SRA: https://www.ncb..
SRR15372439          RNA-Seq BioSample: https://w.. SRA: https://www.ncb..
SRR15372440          RNA-Seq BioSample: https://w.. SRA: https://www.ncb..
            supplementary_file_1  biorep:ch1      disease state:ch1
                     <character> <character>            <character>
SRR15372438                 NONE           1 non-diabetic islet a..
SRR15372439                 NONE           2 non-diabetic islet a..
SRR15372440                 NONE           3 non-diabetic islet a..
                        tissue:ch1
                       <character>
SRR15372438 laser microdissected..
SRR15372439 laser microdissected..
SRR15372440 laser microdissected..

colnames(colData(se))
 [1] "title"                  "geo_accession"          "status"                
 [4] "submission_date"        "last_update_date"       "type"                  
 [7] "channel_count"          "source_name_ch1"        "organism_ch1"          
[10] "characteristics_ch1"    "characteristics_ch1.1"  "characteristics_ch1.2" 
[13] "molecule_ch1"           "extract_protocol_ch1"   "extract_protocol_ch1.1"
[16] "extract_protocol_ch1.2" "taxid_ch1"              "data_processing"       
[19] "data_processing.1"      "data_processing.2"      "data_processing.3"     
[22] "data_processing.4"      "data_processing.5"      "data_processing.6"     
[25] "data_processing.7"      "data_processing.8"      "data_processing.9"     
[28] "platform_id"            "data_row_count"         "instrument_model"      
[31] "library_selection"      "library_source"         "library_strategy"      
[34] "relation"               "relation.1"             "supplementary_file_1"  
[37] "biorep:ch1"             "disease state:ch1"      "tissue:ch1"            

dim(colData(se))
[1] 18 39

We have a total of 39 phenotypic variables. The second column geo_accession contains GEO Sample Accession Number (GSM) identifiers. GSM identifiers define individual samples, understood in our context as individual sources of RNA. Our samples are not repeated, indicating that among the 18 samples we have not technical replicates:

ncol(se)
[1] 18
length(unique(se$geo_accession))
[1] 18
table(lengths(split(colnames(se), se$geo_accession)))

 1 
18 

So, we have 18 different individual samples out of 18 total samples so we have not technical replicates.

To proceed further exploring this dataset, we are going to use the edgeR package and build a DGEList object, incorporating the gene metadata, which includes the gene symbol. We have to be sure that the dimensions of the DGEList object and the SummarizedExperiment object are the same.

library(edgeR)
dge <- DGEList(counts=assays(se)$counts, genes=rowData(se), samples=colData(se))
dim(dge)
[1] 25118    18
dim(dge)==dim(se)
[1] TRUE TRUE

Now, we calculate \(\log_2\) CPM units of expression and put them as an additional assay element to ease their manipulation.

assays(se)$logCPM <- cpm(dge, log=TRUE, prior.count=0.25)
assays(se)$logCPM[1:3, 1:6] #Visualization
    SRR15372438 SRR15372439 SRR15372440 SRR15372441 SRR15372442 SRR15372443
1     -1.175019   -1.482126   -1.735242   -7.356596  -7.3565962   -2.893128
10    -7.356596   -7.356596   -7.356596   -7.356596  -7.3565962   -7.356596
100    2.050547    2.088666    1.659402    1.630366   0.3887662    1.389090

Let’s explore now some of the phenotypic variables.

Some of them provide information about the samples: title (ID for each sample), geo_accesion, type (SRA), relation, characteristics.

We have also information about the study such as status, submission_data, last_update_date.

Then we can see features related to the organism (Homo sapiens) in organism_ch1 and taxid_ch1 and to the tissue (laser microdissected islets) in tissue in which the analysis has been conducted. These features remain the same across all samples studied.

table(se$organism_ch1)

Homo sapiens 
          18 

table(se$`tissue:ch1`)

laser microdissected islets 
                         18 

We can also identify some variables associated with technical factors, such as the extraction protocol (extract_protocol_ch1.X, molecule_ch1), RNA-seq experiment (platform_id, instrument_model, library_selection, library_source, library_strategy etc.) and data processing (data_processing.X) that remain the same across samples.

table(se$extract_protocol_ch1)

PicoPure RNA extraction kit (Thermofisher Scientific, cat# 12204-01) with genomic DNA removal using recombinant DNAse1 (Worthington Biochemical Corporation, cat# LS006361) 
                                                                                                                                                                         18 

table(se$data_processing.1)

Reads mapped to human genome reference using Bowtie2. 
                                                   18 

Finally one feature we are particularly interested in is the disease state. We can observe that for each category of the disease state (disease state:ch1), there are several biological replicates (biorep:ch1):

  • Non-diabetic individuals with islet auto-antibodies: 3 samples
  • Non-diabetic controls: 8 samples
  • Diabetic individuals: 7 samples
table(se$`disease state:ch1`)

                                control non-diabetic 
                                                   8 
                                            diabetic 
                                                   7 
non-diabetic islet autoantibody-positive individuals 
                                                   3 

To facilitate handling the variables disease state:ch1 and biorep:ch1 we are going to recode them as follows.

se$disease_state <- as.factor(se$`disease state:ch1`)
levels(se$disease_state) <- c("ND", "T1D", "AAb")

dge$samples$disease_state <- as.factor(dge$samples$disease.state.ch1)
levels(dge$samples$disease_state) <- c("ND", "T1D", "AAb")

se$bioreplicate <- se$`biorep:ch1`

#Visualization
se$disease_state
 [1] AAb AAb AAb ND  ND  ND  ND  ND  ND  ND  ND  T1D T1D T1D T1D T1D T1D T1D
Levels: ND T1D AAb
dge$samples$disease_state
 [1] AAb AAb AAb ND  ND  ND  ND  ND  ND  ND  ND  T1D T1D T1D T1D T1D T1D T1D
Levels: ND T1D AAb

In Table 1 below, we show this variable jointly with the sample ID and the sample title (identifier used in the paper) and bioreplicate to try to gather as much understanding as possible on the underlying experimental design.

Table 1: Phenotypic variables
Identifer Sample Title Disease State Bioreplicate
SRR15372438 AAb_6267 non-diabetic islet autoantibody-positive individuals 1
SRR15372439 AAb_6424 non-diabetic islet autoantibody-positive individuals 2
SRR15372440 AAb_6429 non-diabetic islet autoantibody-positive individuals 3
SRR15372441 ND_6131 control non-diabetic 1
SRR15372442 ND_6232 control non-diabetic 2
SRR15372443 ND_6234 control non-diabetic 3
SRR15372444 ND_6279 control non-diabetic 4
SRR15372445 ND_6293 control non-diabetic 5
SRR15372446 ND_6336 control non-diabetic 6
SRR15372447 ND_6339 control non-diabetic 7
SRR15372448 ND_6401 control non-diabetic 8
SRR15372449 T1D_6209 diabetic 1
SRR15372450 T1D_6228 diabetic 2
SRR15372451 T1D_6306 diabetic 3
SRR15372452 T1D_6342 diabetic 4
SRR15372453 T1D_6362 diabetic 5
SRR15372454 T1D_6371 diabetic 6
SRR15372455 T1D_6414 diabetic 7

When comparing three main groups for differential gene expression, our objective is to assess the differences in gene expression patterns between each pair of groups. This involves pairwise comparisons between the three groups. In Table 2 we can see how many bioreplicates we have for each condition.

Table 2: Frequency table for disease state
Disease state Frequency
ND 8
T1D 7
AAb 3

2.2 Sequencing depth

Let’s examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure 1 below shows the sequencing depth per sample, also known as library sizes, in increasing order.

Library sizes in increasing order.

Figure 1: Library sizes in increasing order

We don’t see substantial differences in sequencing depth. The sequencing depth ranges from 35 to 48 million reads. Lower sequencing depth samples seem to be enriched in diabetic disease state whereas the other sample groups are not confounded with sequencing depth. Despite this slight difference, there are diabetic samples with higher sequencing depths so the distribution of the different disease states seem to be more or less random indicating no confusion with sequencing depth.

2.3 Distribution of expression levels among samples

Figures 2 and 3 below show the distribution of expression values per sample in logarithmic CPM units of expression.

Non-parametric density distribution of expression profiles per sample.

Figure 2: Non-parametric density distribution of expression profiles per sample

Boxplot of expression profiles per sample

Figure 3: Boxplot of expression profiles per sample

We can see in Figure 2 the characteristic two peaks, one around negative values of \(\log_2\) CPM representing those genes that are not expressed, and the other around positive values representing those genes that are expressed. There are no substantial differences between the samples in the distribution of expression values. The samples show big overlap.

On the other hand, when depicting the boxplot (Figure 3) to assess differential distribution or variability, only one of the samples (T1D_6209) appears to exhibit a slightly different mean value of \(\log_2\) CPM, as depicted in Figure 3. However, this observation is not particularly noteworthy.

2.4 Distribution of expression levels among genes

Let’s calculate now the average expression per gene through all the samples. Figure 4 shows the distribution of those values across genes.

Distribution of average expression level per gene.

Figure 4: Distribution of average expression level per gene

As expected, we have two modes, one for genes that are lowly expressed and another for genes which are more expressed in the samples. We will consider that the lowly expressed genes are the ones below 1 \(\log_2\) CPM levels (red line in Figure 4).

2.5 Filtering of lowly-expressed genes

We filter lowly-expressed genes using the function filterByExpr(), grouping by sample-group to define the minimum number of samples in which a gene should be expressed.

mask <- filterByExpr(dge, group=se$disease_state)
se.filt <- se[mask, ]
dim(se.filt)
[1] 17415    18
dge.filt <- dge[mask, ]
dim(dge.filt)
[1] 17415    18
Distribution of average expression level per gene with filtered genes highlighted in red 1

Figure 5: Distribution of average expression level per gene with filtered genes highlighted in red 1

We are left with 17415 genes in comparison to the 25118 genes we had at the beginning. We can visualize this in Figure 5, since the genes that are left after filtering lowly-expressed genes are highlighted in red.

With this initial filtering method, we fail to eliminate all genes with low expression levels (Figure 5). Many genes exhibit negative \(\log_2\) CPM levels. However, ultimately, we opted for a filtering cutoff requiring a minimum average of 1 \(\log_2\) CPM unit.

mask <- rowMeans(assays(se)$logCPM) > 1
se.filt <- se[mask, ]
dge.filt <- dge[mask, ]
dim(dge.filt)
[1] 12185    18
par(mar=c(4, 5, 1, 1))
Distribution of average expression level per gene with filtered genes highlighted in red 2

Figure 6: Distribution of average expression level per gene with filtered genes highlighted in red 2

We are left with 12185 genes in comparison to the 25118 genes we had at the beginning (Figure 6). Note that after filtering with a minimum average of 1, we end up with less genes (12185) in comparison to the first method in which we define a minimum number of samples in which a gene should be expressed (17415). Both dgeList object and se object have been filtered.

2.6 Normalization between samples

We calculate now the normalization factors on the filtered expression data set. This is an application of the Trimmed Mean of M-values (TMM) method, that takes into consideration different RNA composition of the samples by estimating a scaling factor for each library.

dge.filt <- calcNormFactors(dge.filt)

Now we replace the raw \(\log_2\) CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.

assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE, normalized.lib.sizes=TRUE)

2.7 MA-plots

2.7.1 MA-plots between conditions

We will examine now the MA-plots of the expression profiles before and after normalization in Figure 7.

MA-plots of expression values before and after filtering and normalisation.

Figure 7: MA-plots of expression values before and after filtering and normalisation

After filtering out genes with low expression levels and performing between-sample normalization, we eliminate sample-specific effects and biases in differential gene expression caused by low expression. As a result, the proportion of reads assigned to a particular gene in a sample is no longer influenced by the overall expression properties of the entire sample. Instead, it depends solely on the expression level of that gene. We discard counts at low values because ratios between small numbers can lead to inflated fold-changes, giving rise to the particular vuvuzela shape that we can see more clearly in the plots obtained before normalization. In general, fold-changes derived from high expression values are more trustworthy than those derived from low-expression values. As shown in the MA-plots of Figure 7 after normalization, the vuvuzela shape is reduced.

2.7.2 MA-plots for each sample:

We will examine now the MA-plots of each sample compared to the average of the rest of the samples in order to detect any possible problematic samples.

MA-plots of filtered and normalized expression values for each sample AAb vs T1D.

Figure 8: MA-plots of filtered and normalized expression values for each sample AAb vs T1D

We can see in Figure 8 that a number of samples display some expression-level dependent bias. By visualizing the deviation of individual samples from the average expression across all samples, we can detect specific expression patterns in each individual and make conclusions.

Most of the samples exhibit a slight expression-level dependent bias at the low-end and/or high-end of the expression spectrum. However, this bias appears to be more pronounced in samples T1D_6362 and T1D_6371, prompting us to monitor other samples with similar biases for any additional unexpected features. Should such features arise, we may consider their removal from the analysis. Initially, we refrain from eliminating these samples due to the experiment’s limited sample size and the satisfactory quality of other quality control measures.

In instances where this bias manifests at the low end of the expression spectrum, one potential solution could involve implementing a more stringent filter on minimum expression levels. However, upon testing with higher thresholds, we observe a persistent effect and risk discarding numerous genes that could prove valuable for subsequent comparisons.

Furthermore, we note the absence of discernible patterns between samples of the same category.

2.8 Experimental design and batch identification

Here we try to understand the underlying experimental design.

First of all, we can do an a multi-dimensional scaling (MDS) plot between the disease state to see if there is a grouping between samples and to detect possible outliers. As we can see in Figure 9, samples do not show a very good grouping by disease state. The MDS plot suggests that AAb gene expression values appear to lie largely intermediate to those of both ND and T1D that are more separated between them. We can detect for one outlier of a T1D donor. We elected not to exclude this donor based on the additional QC parameters which demonstrated that expression values were highly consistent across the range of expression for all samples as mentioned by the authors of the paper.

Multidimensional scaling plot (MDS) and hierarchical clustering by gene expression of the samples. Labels correspond to DiseaseState_sampleID and colors indicate disease state.

Figure 9: Multidimensional scaling plot (MDS) and hierarchical clustering by gene expression of the samples
Labels correspond to DiseaseState_sampleID and colors indicate disease state.

In the supplementary data, additional phenotypic information about the samples is available, which is not included in the SummarizedExperiment object. To explore the possibility of other sample groupings or potential confounding variables, we intend to augment the colData dataframe of the object with this phenotypic data. Among these variables, we have identified potential confounding factors such as sex, cause of death, BMI, and age. While race could also be considered a potential confounder, the majority of donors in the study were Caucasian, limiting its utility in this context. Since age and BMI are continuous variables, we have decided to define ranges of values for a clearer analysis.


sex <- c("female","male","male","male","female","female","male","female","female","male","female","female","male","male","female","male","female","male")
cause_death <- c("Anoxia", "Head Trauma", "Head Trauma", "Anoxia", "Head Trauma", "Head Trauma", "Head Trauma", "Anoxia", "Head Trauma", "Head Trauma", "Head Trauma", "Cerebral Edema", "Anoxia", "Head Trauma", "Anoxia", "Head Trauma", "Cerebral Edema", "Anoxia")
age <- c(23.0, 17.65, 22.10, 24.2, 14.0, 20.0, 19.0, 9, 14.3, 23.3, 25.07, 5.0, 13.0, 19.0, 14.0, 24.9, 12.5, 23.1)
bmi <- c(18.60, 20.80, 28.90, 34.00, 25.60, 25.00, 24.80, 31.30, 51.40, 19.60, 
         23.50, 15.90, 16.60, 17.40, 24.30, 24.50, 28.40, 28.50)

colnames(dge.filt)<-dge.filt$samples$title

colData(se.filt)$sex <- as.factor(sex)
colData(se.filt)$cause_death <- as.factor(cause_death)

colData(se.filt)$age <- as.numeric(age)
colData(se.filt)$bmi <- as.numeric(bmi)

dge.filt$samples$sex <- as.factor(sex)
dge.filt$samples$cause_death <- as.factor(cause_death)
dge.filt$samples$age <- as.numeric(age)
dge.filt$samples$bmi <- as.numeric(bmi)
Multidimensional scaling plot (MDS) of the samples. Labels correspond to DiseaseState_sampleID and colors indicate the different potential confounding variables

Figure 10: Multidimensional scaling plot (MDS) of the samples
Labels correspond to DiseaseState_sampleID and colors indicate the different potential confounding variables

From Multidimensional Scaling Plots (MDS) of figure 9 we can conclude that with respect to sex, death cause, BMI and age we cannot find any substantial differences or grouping behavior between samples. This leads us to the conclusion that there is no confounding due to these features.

Hierarchical clustering of the samples. Labels correspond to DiseaseState_sampleID and colors indicate the different potential confounding variables.

Figure 11: Hierarchical clustering of the samples
Labels correspond to DiseaseState_sampleID and colors indicate the different potential confounding variables.

According to the previous results, in Figure 11 depicting the hierarchical clusters of the samples colored by the different variables studied, we observe that there is no clustering based on sex, death cause or BMI. However, we have noticed some kind of grouping based on age as the younger samples are on the left and the older on the right.

2.8.1 MDS AND HC CONCLUSION

We examined how samples group together by multidimensional scaling and hierarchical clustering, annotating disease state and sex, BMI, age and cause of death. We calculate again log CPM values with a high prior count(3) to moderate extreme fold-changes produced by low counts.

In the MDS plots shown in Figure 10, a conspicuous clustering pattern is observed, characterized by a predominant cluster encompassing the majority of samples, followed by 3 outliers positioned distinctly apart from the main cluster. Particularly noteworthy is the sample labeled T1D_6109, corresponding to the youngest individual in the dataset, aged 5 years, as indicated in the MDS plot by age. Notably, there is no apparent clustering pattern based on any examined variable. The presence of this distinct outlier, specifically the sample from the 5-year-old child, suggests a significant deviation from the overall expression profiles observed in the dataset, such deviation warrant further investigation to elucidate underlying factors contributing to the observed separation, potentially shedding light on age-related expression patterns or technical artifacts influencing the analysis.

In the hierarchical clustering analysis annotated by sex (see Figure 11), no distinct clustering pattern emerges based on this variable. This observation suggests that the samples do not segregate into discernible groups based on gender. On the other hand, in the hierarchical clustering analysis annotated by BMI (see Figure 11), while there is clustering observed, it is not prominently delineated by this variable alone. Notably, the samples primarily cluster based on BMI, with a notable concentration of underweight samples within a distinct group. However, other BMI categories do not exhibit clear, separate clusters, suggesting that additional factors may contribute to the overall clustering pattern observed.

In the hierarchical clustering analysis annotated by age (see Figure 11), a discernible grouping of samples is evident. This pronounced clustering pattern suggests that age could indeed be a confounding variable in the dataset analysis. The presence of clearly delineated age-based clusters implies that age-related biological or developmental differences may significantly influence the expression profiles observed in the samples. Consequently, failing to account for age as a confounding variables in the analysis could potentially introduce bias or confound the interpretation of results. Therefore, while age should be carefully considered and appropriately addressed as a potential confounding variables, additional validation and exploration are warranted to confirm its true impact on the clustering patterns observed.

2.9 Assesment of expression changes

We perform a simple assessment of the extent of expression changes and their associated p-values using the F-test implemented in the R/Bioconductor package sva. We compare T1D with AAb, T1D with ND and ND with AAb samples. We first subset the data as follows:

# Subset for T1D vs AAb
#T1D vs AAb
se.filt_T1D_AAb <- se.filt[, se.filt$disease_state %in% c("T1D", "AAb")]
se.filt_T1D_AAb$disease_state <- droplevels(se.filt_T1D_AAb$disease_state)

#AAb vs ND
se.filt_AAb_ND <- se.filt[, se.filt$disease_state %in% c("ND", "AAb")]
se.filt_AAb_ND$disease_state <- droplevels(se.filt_AAb_ND$disease_state)

#ND vs T1D
se.filt_ND_T1D <- se.filt[, se.filt$disease_state %in% c("T1D", "ND")]
se.filt_ND_T1D$disease_state <- droplevels(se.filt_ND_T1D$disease_state)

In the second step above, we dropped the unused levels from the disease_state factor variable. This is important to avoid using factor levels that do not exist in this subset of the data. We build now the corresponding full and null model matrices.

2.9.1 Without adjusting for confounding variables:

# Subset for T1D vs AAb
mod_T1D_AAb <- model.matrix(~ disease_state, colData(se.filt_T1D_AAb))
mod0_T1D_AAb <- model.matrix(~ 1, colData(se.filt_T1D_AAb))

# Subset for AAb vs ND
mod_AAb_ND <- model.matrix(~ disease_state, colData(se.filt_AAb_ND))
mod0_AAb_ND <- model.matrix(~ 1, colData(se.filt_AAb_ND))

# Subset for ND vs T1D
mod_ND_T1D <- model.matrix(~ disease_state, colData(se.filt_ND_T1D))
mod0_ND_T1D <- model.matrix(~ 1, colData(se.filt_ND_T1D))

Finally, we conduct the F-test implemented in the package sva and examine the amount of differential expression between pairwise disease_states.

library(sva)

# T1D vs AAb
pv_T1D_AAb <- f.pvalue(assays(se.filt_T1D_AAb)$logCPM, mod_T1D_AAb, mod0_T1D_AAb)
sum(p.adjust(pv_T1D_AAb, method="fdr") < 0.05)
[1] 0
sum(p.adjust(pv_T1D_AAb, method="fdr") < 0.1)
[1] 0

# AAb vs ND
pv_AAb_ND <- f.pvalue(assays(se.filt_AAb_ND)$logCPM, mod_AAb_ND, mod0_AAb_ND)
sum(p.adjust(pv_AAb_ND, method="fdr") < 0.05)
[1] 0
sum(p.adjust(pv_AAb_ND, method="fdr") < 0.1)
[1] 0

# ND vs T1D
pv_ND_T1D <- f.pvalue(assays(se.filt_ND_T1D)$logCPM, mod_ND_T1D, mod0_ND_T1D)
sum(p.adjust(pv_ND_T1D, method="fdr") < 0.05)
[1] 230
sum(p.adjust(pv_ND_T1D, method="fdr") < 0.1)
[1] 841

No DE genes were found between T1D and AAb or ND and AAB. For the comparison between ND and T1D, we identified 230 DE genes at FDR < 5%, but 841 DE genes at FDR < 10%. The distribution of the resulting p-values is shown in Figure 12.

Distribution of raw p-values for an F-test on every gene between T1D vs AAb, AAb vs ND and ND vs T1D samples.

Figure 12: Distribution of raw p-values for an F-test on every gene between T1D vs AAb, AAb vs ND and ND vs T1D samples

We can see in Figure 12 the distribution of p-values of the comparisons. Under the null hypothesis should be uniform. Assuming that most genes are not differentially expressed (DE), the histogram should have looked uniform with a peak at low p-values for the truly DE genes. Departures from this shape indicate within sample group heterogeneity that may need to be adjusted.

2.9.2 Adjusting for known and unknown confounding variables:

Now we adjust for age as known confounding effect and we adjust for unknown sources of variation with SVA. Then we conduct again the F-test implemented in the package sva and examine the amount of differential expression between pairwise disease_states.

# Subset for T1D vs AAb
mod_T1D_AAb <- model.matrix(~ disease_state + age, colData(se.filt_T1D_AAb))
mod0_T1D_AAb <- model.matrix(~ age, colData(se.filt_T1D_AAb))

# Subset for AAb vs ND
mod_AAb_ND <- model.matrix(~ disease_state + age, colData(se.filt_AAb_ND))
mod0_AAb_ND <- model.matrix(~ age, colData(se.filt_AAb_ND))

# Subset for ND vs T1D
mod_ND_T1D <- model.matrix(~ disease_state + age, colData(se.filt_ND_T1D))
mod0_ND_T1D <- model.matrix(~ age, colData(se.filt_ND_T1D))
library(sva)
sv1 <- sva(assays(se.filt_T1D_AAb)$logCPM, mod_T1D_AAb, mod0_T1D_AAb)
Number of significant surrogate variables is:  2 
Iteration (out of 5 ):1  2  3  4  5  
names(sv1)
[1] "sv"        "pprob.gam" "pprob.b"   "n.sv"     

sv2 <- sva(assays(se.filt_AAb_ND)$logCPM, mod_AAb_ND, mod0_AAb_ND)
Number of significant surrogate variables is:  2 
Iteration (out of 5 ):1  2  3  4  5  
names(sv2)
[1] "sv"        "pprob.gam" "pprob.b"   "n.sv"     

sv3 <- sva(assays(se.filt_ND_T1D)$logCPM, mod_ND_T1D, mod0_ND_T1D)
Number of significant surrogate variables is:  3 
Iteration (out of 5 ):1  2  3  4  5  
names(sv3)
[1] "sv"        "pprob.gam" "pprob.b"   "n.sv"     

#For SV1
modSV1 <- cbind(mod_T1D_AAb, sv1$sv)
mod0SV1 <- cbind(mod0_T1D_AAb, sv1$sv)
pvalsSV1 <- f.pvalue(assays(se.filt_T1D_AAb)$logCPM, modSV1, mod0SV1)
sum(p.adjust(pvalsSV1, method="fdr") < 0.05)
[1] 0

#For SV2
modSV2 <- cbind(mod_AAb_ND, sv2$sv)
mod0SV2 <- cbind(mod0_AAb_ND, sv2$sv)
pvalsSV2 <- f.pvalue(assays(se.filt_AAb_ND)$logCPM, modSV2, mod0SV2)
sum(p.adjust(pvalsSV2, method="BH") < 0.05)
[1] 0

#For SV3
modSV3 <- cbind(mod_ND_T1D, sv3$sv)
mod0SV3 <- cbind(mod0_ND_T1D, sv3$sv)
pvalsSV3 <- f.pvalue(assays(se.filt_ND_T1D)$logCPM, modSV3, mod0SV3)
sum(p.adjust(pvalsSV3, method="BH") < 0.05)
[1] 315

Adjusting for surrogate variables (SVs) in gene expression analysis did not reveal any differentially expressed genes between pair of conditions except for 315 DE genes between T1D and ND samples, higher than the 230 identified previously at FDR < 5%.

Distribution of raw p-values for an F-test on every gene between T1D vs AAb, AAb vs ND and ND vs T1D samples.

Figure 13: Distribution of raw p-values for an F-test on every gene between T1D vs AAb, AAb vs ND and ND vs T1D samples

We can see that the histograms of the p-value distribution are more uniform now (Figure 13).

2.9.3 Tables of DE genes:

Now we build a table with the differentially expressed (DE) genes having an FDR < 10%, only for the comparison T1D and ND that is the only one that has any DE gene adjusting for surrogate variables (SV).

Finally for the comparison between ND and T1D. We obtain 841 DE genes at FDR < 10%. Below, we present the top-10 genes with the lowest p-values in Table 3.

Table 3: Top-10 differentially expressed genes with lowest p-value between ND and T1D with FDR < 10%. To see the full list of DE genes, follow this link or download this CSV file.
EntrezID Symbol Description P value
1 3134 HLA-F major histocompatibility complex, class I, F 1.00e-07
2 4495 MT1G metallothionein 1G 1.30e-06
3 10670 RRAGA Ras related GTP binding A 1.70e-06
4 9844 ELMO1 engulfment and cell motility 1 1.70e-06
5 6726 SRP9 signal recognition particle 9 2.30e-06
6 23328 SASH1 SAM and SH3 domain containing 1 2.40e-06
7 972 CD74 CD74 molecule 6.60e-06
8 5861 RAB1A RAB1A, member RAS oncogene family 1.06e-05
9 90987 ZNF251 zinc finger protein 251 1.58e-05
10 3033 HADH hydroxyacyl-CoA dehydrogenase 1.78e-05

3 Differential expression

In order to analyze the differential expression (DE) between the three conditions, we will use the Bioconductor package limma, which implements procedures to carry out DE analyses on RNA-seq data implementing the mean-variance relationship into linear regression models. We have analyzed DE without taking into account covariates, taking into account age as a confounding variable and also taking into account age and unknown covariates. Also we have used both the limma-trend and limma-voom pipelines. The results were more or less similar but in our case, the dataset consist of a relatively small number of replicates (8 ND, 7 T1D, and 3 AAb) and minimal variation in library sizes, so we will use limma-trend pipeline.

# T1D vs AAb
se.filt_T1D_AAb$disease_state <- relevel(se.filt_T1D_AAb$disease_state, ref="T1D")
mod1 <- model.matrix(~ disease_state + age, data=colData(se.filt_T1D_AAb))
mod0.1 <- model.matrix(~age, data=colData(se.filt_T1D_AAb))
sv1 <- sva(assays(se.filt_T1D_AAb)$logCPM, mod=mod1, mod0=mod0.1)
Number of significant surrogate variables is:  2 
Iteration (out of 5 ):1  2  3  4  5  
mod1 <- cbind(mod1, sv1$sv)
colnames(mod1) <- c(colnames(mod1)[1:3], paste0("SV", 1:sv1$n))
head(mod1, 3)
            (Intercept) disease_stateAAb   age       SV1        SV2
SRR15372438           1                1 23.00 0.2723930 -0.1973451
SRR15372439           1                1 17.65 0.1995298 -0.1710604
SRR15372440           1                1 22.10 0.2350359 -0.1471492
head(rowData(se.filt_AAb_ND))
DataFrame with 6 rows and 5 columns
              gene_id   gene_biotype            description    gene_id_version
          <character>    <character>            <character>        <character>
100   ENSG00000196839 protein_coding adenosine deaminase .. ENSG00000196839.12
1000  ENSG00000170558 protein_coding cadherin 2 [Source:H..  ENSG00000170558.8
10000 ENSG00000117020 protein_coding AKT serine/threonine.. ENSG00000117020.16
10001 ENSG00000133997 protein_coding mediator complex sub.. ENSG00000133997.11
10003 ENSG00000077616 protein_coding N-acetylated alpha-l.. ENSG00000077616.10
10004 ENSG00000168060 protein_coding N-acetylated alpha-l.. ENSG00000168060.15
           symbol
      <character>
100           ADA
1000         CDH2
10000        AKT3
10001        MED6
10003     NAALAD2
10004    NAALADL1
#Fit the model
fit1 <- lmFit(assays(se.filt_T1D_AAb)$logCPM, mod1)
#Calculate moderate t statistics
fit1 <- eBayes(fit1, trend=TRUE)
#Add gene metadata
genesmd <- data.frame(chr=as.character(seqnames(rowRanges(se.filt_T1D_AAb))),
                       symbol=rowData(se.filt_T1D_AAb)[, 5],
                       stringsAsFactors=FALSE)

fit1$genes <- genesmd
tt1 <- topTable(fit1, coef=2, n=Inf)
DEgenes1 <- rownames(tt1)[tt1$adj.P.Val < 0.1]
length(DEgenes1)
[1] 0
# AAb vs ND
se.filt_AAb_ND$disease_state <- relevel(se.filt_AAb_ND$disease_state, ref = "AAb")
mod2 <- model.matrix(~ disease_state + age, data = colData(se.filt_AAb_ND))
mod0.2 <- model.matrix(~ age, data = colData(se.filt_AAb_ND))
sv2 <- sva(assays(se.filt_AAb_ND)$logCPM, mod = mod2, mod0 = mod0.2)
Number of significant surrogate variables is:  2 
Iteration (out of 5 ):1  2  3  4  5  
mod2 <- cbind(mod2, sv2$sv)
colnames(mod2) <- c(colnames(mod2)[1:3], paste0("SV", 1:sv2$n))
head(mod2, 3)
            (Intercept) disease_stateND   age        SV1        SV2
SRR15372438           1               0 23.00 -0.2577965 -0.1856249
SRR15372439           1               0 17.65 -0.1373232  0.5120628
SRR15372440           1               0 22.10 -0.3087993  0.2482733

# Fit the model
fit2 <- lmFit(assays(se.filt_AAb_ND)$logCPM, mod2)
# Calculate moderated t statistics
fit2 <- eBayes(fit2, trend = TRUE)
# Add gene metadata
genesmd2 <- data.frame(chr = as.character(seqnames(rowRanges(se.filt_AAb_ND))),
                       symbol = rowData(se.filt_AAb_ND)[, 5],
                       stringsAsFactors = FALSE)
fit2$genes <- genesmd2
tt2 <- topTable(fit2, coef = 2, n = Inf)
DEgenes2 <- rownames(tt2)[tt2$adj.P.Val < 0.1]
length(DEgenes2)
[1] 2
# ND vs T1D
se.filt_ND_T1D$disease_state <- relevel(se.filt_ND_T1D$disease_state, ref = "ND")
mod3 <- model.matrix(~ disease_state + age, data = colData(se.filt_ND_T1D))
mod0.3 <- model.matrix(~ age, data = colData(se.filt_ND_T1D))
sv3 <- sva(assays(se.filt_ND_T1D)$logCPM, mod = mod3, mod0 = mod0.3)
Number of significant surrogate variables is:  3 
Iteration (out of 5 ):1  2  3  4  5  
mod3 <- cbind(mod3, sv3$sv)
colnames(mod3) <- c(colnames(mod3)[1:3], paste0("SV", 1:sv3$n))
head(mod3, 3)
            (Intercept) disease_stateT1D  age        SV1        SV2         SV3
SRR15372441           1                0 24.2 -0.1916294 -0.2056604 -0.05426709
SRR15372442           1                0 14.0 -0.0116619  0.2596077  0.23688536
SRR15372443           1                0 20.0 -0.2585012 -0.1912782 -0.12645233

# Fit the model
fit3 <- lmFit(assays(se.filt_ND_T1D)$logCPM, mod3)
# Calculate moderated t statistics
fit3 <- eBayes(fit3, trend = TRUE)
# Add gene metadata
genesmd3 <- data.frame(chr = as.character(seqnames(rowRanges(se.filt_ND_T1D))),
                       symbol = rowData(se.filt_ND_T1D)[, 5],
                       stringsAsFactors = FALSE)
fit3$genes <- genesmd3
tt3 <- topTable(fit3, coef = 2, n = Inf)
DEgenes3 <- rownames(tt3)[tt3$adj.P.Val < 0.1]
length(DEgenes3)
[1] 2019

There are no DE genes between T1D and AAb samples whereas We identified 2 DE genes at FDR < 10% between AAb and ND patients and 2019 DE genes at FDR < 10% between ND and T1D samples. The distribution of the resulting p-values is shown in Figure 14.

Distribution of raw p-values and qq plots of the moderated t statistics for the test on every gene between T1D vs AAb, AAb vs ND and ND vs T1D samples with limma-trend adjusting for known and unknown variables.

Figure 14: Distribution of raw p-values and qq plots of the moderated t statistics for the test on every gene between T1D vs AAb, AAb vs ND and ND vs T1D samples with limma-trend adjusting for known and unknown variables

We can observe in Figure 14 that the distribution of the raw p-values is the expected, since we see the maximum peak at the left end of the distribution, and the rest is more or less uniform in ND vs T1D. On the other side we don’t see that peak in T1D vs AAb and AAb vs ND where distributions are more or less uniform. This is consistent with the fact that the comparison T1D vs AAb has no DE genes and AAb vs ND only two at FDR < 10%.

In the case of the QQplot, we can see that we have not any/many DE genes between AAb and T1D or AAb vs ND as the points in the QQ plot lie close to the reference line (the theoretical line), it indicates that the observed t-statistics follow the expected distribution under the null hypothesis. This suggests that there are no substantial deviations, and the null hypothesis (no differential expression) is mostly true for these comparisons.

In the case of ND vs T1D, in this comparison, the QQ plot shows significant deviations from the theoretical line at both extremes (negative and positive values). This deviation indicates that the observed t-statistics are more extreme than what would be expected under the null hypothesis. A downward deviation at the lower end suggests an excess of negative t-statistics, indicating more genes are significantly downregulated in ND compared to T1D whereas an upward deviation at the upper end suggests an excess of positive t-statistics, indicating more genes are significantly upregulated in ND compared to T1D.

Moreover, we can use volcano plots to examine the extent of DE for a contrast of interest.

Volcano Plots showing DE genes, with upregulated and downregulated genes in the right and left respectively.

Figure 15: Volcano Plots showing DE genes, with upregulated and downregulated genes in the right and left respectively

We can se that no DE genes were found between AAb and T1D in Figure 15. In the case of ND vs AAb, there is only two DE genes, one downregulated and the other upregulated. On the other hand, more than 2000 genes have a differential expression in the ND vs T1D condition. In Figure 14 we can see the top 10 genes names.

4 Functional analysis

4.1 Gene Ontology Analysis

Our next goal is to contextualize the differentially expressed genes identified in the previous section to determine which biological processes are involved. A common method for functional enrichment analysis involves conducting a Gene Ontology (GO) analysis, which typically entails applying the one-tailed Fisher’s exact test to each gene set in the GO database. It will be done only for the conditions T1D vs ND as the others present a few or no DE genes.

library(org.Hs.eg.db)
library(GOstats)

# Define geneUniverse as all genes
geneUniverse <- rownames(se)

# ND vs T1D (tt3)

# Build parameters object
params3 <- new("GOHyperGParams", geneIds=DEgenes3,    
                 universeGeneIds=geneUniverse,
                 annotation="org.Hs.eg.db", ontology="BP",
                 pvalueCutoff=0.05, testDirection="over")

#Run functional enrichment
hgOver3 <- hyperGTest(params3)
hgOver3
Gene to GO BP  test for over-representation 
9504 GO BP ids tested (1718 have p < 0.05)
Selected gene set size: 1850 
    Gene universe size: 18098 
    Annotation package: org.Hs.eg 

# Results
resHgOver3<- summary(hgOver3)
dim(resHgOver3)
[1] 1718    7
head(resHgOver3)
      GOBPID       Pvalue OddsRatio ExpCount Count Size
1 GO:0051179 3.116919e-25  1.702443 540.4437   737 5287
2 GO:0051641 3.196177e-25  1.818155 343.3611   515 3359
3 GO:0070727 4.513427e-25  1.924526 255.9620   410 2504
4 GO:0008104 6.806964e-25  1.921959 254.8376   408 2493
5 GO:0051234 2.127719e-23  1.692510 475.7377   658 4654
6 GO:0045184 2.146591e-22  2.036221 172.9583   297 1692
                                   Term
1                          localization
2                 cellular localization
3   cellular macromolecule localization
4                  protein localization
5         establishment of localization
6 establishment of protein localization
Distribution of p-values of the GO enrichment analysis.

Figure 16: Distribution of p-values of the GO enrichment analysis

We can se in Figure 16 that the P-value distribution is enriched towards low P-values, which often makes this type of analysis too optimistic. A challenge in Gene Ontology (GO) analysis arises from the interdependence of terms due to their hierarchical structure and overlapping gene sets. To address this, Alexa, Rahnenführer, and Lengauer (2006) propose computing the significance of a GO term conditional on the significance of its children. This involves performing a bottom-up conditional test, where genes associated with a significant child term are removed from its parent terms. This approach prioritizes more specific and relevant terms in the analysis.

# ND vs T1D (tt3)
conditional(params3) <- TRUE
hgOverCond3 <- hyperGTest(params3)
hgOverCond3
Gene to GO BP Conditional test for over-representation 
9504 GO BP ids tested (973 have p < 0.05)
Selected gene set size: 1850 
    Gene universe size: 18098 
    Annotation package: org.Hs.eg 

#into a df
resHgOverCond3 <- summary(hgOverCond3)
dim(resHgOverCond3)
[1] 973   7
head(resHgOverCond3)
      GOBPID       Pvalue OddsRatio  ExpCount Count Size
1 GO:0033036 8.637072e-21  1.754928 305.33484   453 2987
2 GO:0032879 6.837364e-14  1.695413 202.30631   302 1985
3 GO:0051641 2.871855e-11  1.784969 121.31310   192 1272
4 GO:0051649 5.145642e-11  1.641630 172.84255   254 1713
5 GO:0071840 9.318193e-10  1.351808 685.18897   805 6703
6 GO:0071692 5.558165e-09  2.262064  38.63963    76  378
                                           Term
1                    macromolecule localization
2                    regulation of localization
3                         cellular localization
4         establishment of localization in cell
5 cellular component organization or biogenesis
6  protein localization to extracellular region

#Filter with a minimum number of count and size
mask <- resHgOverCond3$Size >= 3 & resHgOverCond3$Size <= 300 & + resHgOverCond3$Count >= 3
resHgOverCond3 <- resHgOverCond3[mask, ]
dim(resHgOverCond3)
[1] 702   7
ord <- order(resHgOverCond3$OddsRatio, decreasing=TRUE)
resHgOverCond3 <- resHgOverCond3[ord, ]

#Extratc genes of each GO term
geneIDs <- geneIdsByCategory(hgOverCond3)[resHgOverCond3$GOBPID]
geneSYMs <- sapply(geneIDs, function(id) mapIds(org.Hs.eg.db, keys = id, column = "SYMBOL", keytype = "ENTREZID"))
geneSYMs <- sapply(geneSYMs, paste, collapse=", ")
goresults <- cbind(resHgOverCond3, Genes=geneSYMs)
rownames(goresults) <- 1:nrow(goresults)
head(goresults)
      GOBPID       Pvalue OddsRatio  ExpCount Count Size
1 GO:0002477 1.066574e-03       Inf 0.3066637     3    3
2 GO:0006613 4.931174e-04  35.34241 0.5093233     4    5
3 GO:0002502 4.999107e-04  35.20477 0.5111062     4    5
4 GO:0030070 2.641905e-05  26.43059 0.8177699     6    8
5 GO:0014809 3.939691e-03  26.38928 0.4088850     3    4
6 GO:0019249 3.939691e-03  26.38928 0.4088850     3    4
                                                                                           Term
1             antigen processing and presentation of exogenous peptide antigen via MHC class Ib
2                                                 cotranslational protein targeting to membrane
3                                     peptide antigen assembly with MHC class I protein complex
4                                                                            insulin processing
5 regulation of skeletal muscle contraction by regulation of release of sequestered calcium ion
6                                                                  lactate biosynthetic process
                                    Genes
1                      HLA-E, HLA-F, TAP2
2               ARL6IP1, SIL1, SSR1, SSR2
3             PDIA3, HLA-A, TAPBPL, TAPBP
4 CPE, SLC30A8, P4HB, PCSK1, ERO1B, YIPF5
5                       DMD, GSTM2, GSTO1
6                      PARK7, GATD1, PER2

ktab <- kable(goresults, "html", caption="GO results.") %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive"), fixed_thead = TRUE)
fnameHTML_ND_D <- "ND_T1D_go.html"
fpathHTML_ND_D <- file.path(path2pkg, "doc", fnameHTML_ND_D)
save_kable(ktab, file = fpathHTML_ND_D, self_contained = TRUE)
Distribution of p-values of the GO enrichment analysis after conditionaltest.

Figure 17: Distribution of p-values of the GO enrichment analysis after conditionaltest

The number of significant GO terms has decreased and the histogram of Figure 17 shows that it is now less biased to low pvalues after doing the conditional test.

For a better representation of the topGO terms we visualize them using a Dotplot.

Dotplot of the top 25 GO enriched terms in T1D vs ND comparison.

Figure 18: Dotplot of the top 25 GO enriched terms in T1D vs ND comparison

We can see in 18 several significantly enriched biological processes. Antigen processing and presentation, both endogenous and exogenous, highlight the autoimmune nature of type 1 diabetes, leading to beta-cell destruction. Processes like gluconeogenesis, glucose homeostasis, and hexose metabolism underscore metabolic dysregulation, resulting in hyperglycemia. Protein-related processes, including folding in the ER, localization to membranes, and positive regulation of protein transport and secretion, emphasize the cellular stress and impaired insulin production characteristic of type 1.

4.2 Gene Set Enrichment Analysis (GSEA)

In addition to performing functional analysis with Gene Ontology (GO), we employed the Gene Set Enrichment Analysis (GSEA) approach to assess pathway enrichment for each gene set. GSEA calculates an enrichment score (ES) based on the position of genes within a ranked list of changes in gene expression. Significance then, is calculated by generating an empirical null distribution of ES values and either (1) permute phenotype labels; or (2) permute gene set memberships. In our case we will permute gene set memberships. Permuting gene set memberships is a valuable approach for evaluating differences between pairs of conditions in gene set enrichment analysis. This method allows us to identify gene sets whose functions or characteristics are inherently associated with differential expression patterns across conditions, providing insights into the underlying biological processes driving these differences.

GSEA is particularly adept at detecting subtle yet coordinated changes in gene expression, offering insights into the underlying molecular mechanisms implicated in the disease. This approach enables us to elucidate the pathways and biological processes disrupted in diabetes, even in instances where highly significant differentially expressed genes may be lacking in certain conditions.

To conduct the GSEA analysis, we need a collection of gene sets. For this purpose, we will use the GeneSetCollection called c2BroadSets from the GSVA, which includes curated gene sets from various sources such as online pathway databases and biomedical literature. Specifically, we will use the KEGG, REACTOME, and BIOCARTA pathways from this GeneSetCollection, which are well-regarded databases providing curated information about biological pathways and networks.

library(GSVAdata)

# C2 Curated Gene Sets
data(c2BroadSets)
c2BroadSets <- c2BroadSets[c(grep("^KEGG", names(c2BroadSets)),
               grep("^REACTOME", names(c2BroadSets)), grep("^BIOCARTA", names(c2BroadSets)))]

gsc<-GeneSetCollection(c2BroadSets)
library(fgsea)

gsets <- geneIds(gsc)

#AAb vs T1D
stats1 <- tt1$t
names(stats1) <- rownames(tt1)
fgseares1 <- fgsea(gsets, stats1, minSize=5, maxSize=300)

#AAb vs ND
stats2 <- tt2$t
names(stats2) <- rownames(tt2)
fgseares2 <- fgsea(gsets, stats2, minSize=5, maxSize=300)

#T1D vs ND
stats3 <- tt3$t
names(stats3) <- rownames(tt3)
fgseares3 <- fgsea(gsets, stats3, minSize=5, maxSize=300)

After the analysis, we filter the resuls based on a FDR threshold of less than 1%.

The enriched pathways with a FDR threshold of less than 1% between AAb and T1D samples are shown in Table 4 below:

Table 4: Enriched Pathways between AAb and T1D samples
Pathway PValue PadjValue ES NES Size LeadingEdge
KEGG_PROTEIN_EXPORT 6.90e-06 0.0019422 0.6959011 2.265054 23 90701, 5….
REACTOME_DOWNSTREAM_EVENTS_IN_GPCR_SIGNALING 7.20e-06 0.0019422 -0.3800618 -1.797856 180 2786, 57….
REACTOME_TRNA_AMINOACYLATION 5.00e-06 0.0019422 0.5975938 2.225263 39 57176, 6….
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION 1.14e-05 0.0022942 0.3756189 1.839260 150 5223, 21….
REACTOME_MITOTIC_M_M_G1_PHASES 2.32e-05 0.0037502 0.3935004 1.851023 121 5714, 63….
KEGG_PARKINSONS_DISEASE 2.80e-05 0.0037630 0.4040849 1.889321 114 2861, 29….
KEGG_MELANOGENESIS 4.87e-05 0.0056130 -0.5057011 -1.987588 57 7482, 19….

Now we represent the enriched pathways between AAb and T1D in a dot plot (Figure 19).

Dotplot of the top 25 GSEA enriched pathways between AAb vs T1D.

Figure 19: Dotplot of the top 25 GSEA enriched pathways between AAb vs T1D

The enriched pathways with a FDR threshold of less than 1% between AAb and ND samples are shown in Table 5 below:

Table 5: Enriched Pathways between AAb and ND samples
Pathway PValue PadjValue ES NES Size LeadingEdge
KEGG_RIBOSOME 0.0000000 0.0000000 0.8064631 3.452906 79 6206, 62….
REACTOME_PEPTIDE_CHAIN_ELONGATION 0.0000000 0.0000000 0.8085119 3.453679 77 6206, 62….
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 0.0000000 0.0000000 0.7720358 3.355927 87 6206, 62….
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 0.0000000 0.0000000 0.7690621 3.392994 91 6206, 38….
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS 0.0000000 0.0000000 0.7651946 3.375931 91 6206, 62….
REACTOME_TRANSLATION 0.0000000 0.0000000 0.7263010 3.326538 110 6206, 62….
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT 0.0000000 0.0000000 0.7455611 3.327478 97 6206, 62….
REACTOME_VIRAL_MRNA_TRANSLATION 0.0000000 0.0000000 0.7956518 3.398745 77 6206, 62….
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT 0.0000000 0.0000000 0.7346015 3.303028 100 6206, 62….
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION 0.0000000 0.0000000 0.6812328 3.147517 122 6206, 62….
REACTOME_INFLUENZA_LIFE_CYCLE 0.0000000 0.0000000 0.6296879 2.920051 128 6206, 38….
REACTOME_METABOLISM_OF_PROTEINS 0.0000000 0.0000000 0.5606133 2.764842 193 6206, 78….
KEGG_OXIDATIVE_PHOSPHORYLATION 0.0000000 0.0000000 0.6286930 2.889053 112 4537, 47….
REACTOME_SIGNALING_IN_IMMUNE_SYSTEM 0.0000000 0.0000000 -0.5097911 -2.491415 203 3122, 71….
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION 0.0000000 0.0000000 0.5481320 2.630998 150 4191, 45….
REACTOME_ELECTRON_TRANSPORT_CHAIN 0.0000000 0.0000000 0.6769859 2.840892 71 4537, 47….
KEGG_PARKINSONS_DISEASE 0.0000000 0.0000000 0.5707364 2.621775 114 4537, 47….
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM 0.0000000 0.0000000 0.4692747 2.325461 206 2786, 41….
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX 0.0000000 0.0000000 0.7466731 2.802486 44 6206, 62….
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION 0.0000000 0.0000000 0.7128553 2.751623 50 6206, 62….
REACTOME_REGULATION_OF_INSULIN_SECRETION 0.0000000 0.0000000 0.4831362 2.373890 185 2786, 41….
KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION 0.0000000 0.0000000 -0.7234985 -2.740426 45 972, 312….
KEGG_ALLOGRAFT_REJECTION 0.0000000 0.0000000 -0.9139404 -2.688664 16 3122, 31….
KEGG_AUTOIMMUNE_THYROID_DISEASE 0.0000000 0.0000000 -0.9139404 -2.688664 16 3122, 31….
KEGG_GRAFT_VERSUS_HOST_DISEASE 0.0000000 0.0000000 -0.9114850 -2.630330 15 3122, 31….
KEGG_VIRAL_MYOCARDITIS 0.0000000 0.0000001 -0.6838371 -2.578884 44 3122, 31….
KEGG_HUNTINGTONS_DISEASE 0.0000000 0.0000001 0.4613188 2.213853 151 4707, 29….
REACTOME_PD1_SIGNALING 0.0000000 0.0000002 -0.8915716 -2.529687 14 3122, 31….
KEGG_TYPE_I_DIABETES_MELLITUS 0.0000000 0.0000004 -0.8028364 -2.550373 22 3122, 31….
REACTOME_SYNTHESIS_OF_DNA 0.0000001 0.0000017 0.5370809 2.297133 78 6119, 23….
REACTOME_COSTIMULATION_BY_THE_CD28_FAMILY 0.0000001 0.0000019 -0.6420463 -2.439233 46 3122, 31….
KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS 0.0000001 0.0000031 -0.6325766 -2.420948 47 3122, 71….
REACTOME_PHOSPHORYLATION_OF_CD3_AND_TCR_ZETA_CHAINS 0.0000002 0.0000044 -0.8709451 -2.445945 13 3122, 31….
KEGG_CELL_ADHESION_MOLECULES_CAMS 0.0000002 0.0000057 -0.5362391 -2.228927 76 3122, 31….
REACTOME_S_PHASE 0.0000003 0.0000063 0.5026120 2.196143 89 6119, 23….
REACTOME_TCR_SIGNALING 0.0000006 0.0000130 -0.6389978 -2.359680 41 3122, 31….
KEGG_ASTHMA 0.0000009 0.0000201 -0.8654647 -2.367582 12 3122, 31….
KEGG_INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION 0.0000009 0.0000201 -0.7592434 -2.382771 21 3122, 31….
REACTOME_G1_S_TRANSITION 0.0000016 0.0000329 0.4864347 2.085497 83 6119, 23….
REACTOME_DOWNSTREAM_TCR_SIGNALING 0.0000019 0.0000373 -0.6926346 -2.371359 29 3122, 31….
REACTOME_TRANSLOCATION_OF_ZAP70_TO_IMMUNOLOGICAL_SYNAPSE 0.0000020 0.0000388 -0.8934702 -2.341962 10 3122, 31….
KEGG_ALZHEIMERS_DISEASE 0.0000022 0.0000428 0.4175936 1.956413 136 7132, 47….
REACTOME_GENERATION_OF_SECOND_MESSENGER_MOLECULES 0.0000028 0.0000529 -0.7700026 -2.368328 19 3122, 31….
REACTOME_DNA_REPLICATION_PRE_INITIATION 0.0000053 0.0000981 0.5063248 2.093552 67 6119, 23….
KEGG_LEISHMANIA_INFECTION 0.0000063 0.0001125 -0.5831280 -2.208739 45 3122, 31….
REACTOME_ACTIVATION_OF_ATR_IN_RESPONSE_TO_REPLICATION_STRESS 0.0000098 0.0001712 0.6709400 2.259362 27 6119, 23….
REACTOME_G2_M_CHECKPOINTS 0.0000173 0.0002979 0.6221570 2.167031 31 6119, 23….
KEGG_COMPLEMENT_AND_COAGULATION_CASCADES 0.0000184 0.0003096 -0.5810349 -2.180938 43 713, 712….
REACTOME_IMMUNOREGULATORY_INTERACTIONS_BETWEEN_A_LYMPHOID_AND_A_NON_LYMPHOID_CELL 0.0000307 0.0005059 -0.6799889 -2.187130 24 2214, 31….
BIOCARTA_BLYMPHOCYTE_PATHWAY 0.0000318 0.0005139 -0.9326611 -2.031101 6 3122, 31….
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN 0.0000335 0.0005299 0.5194842 2.081748 57 23595, 1….
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING 0.0000372 0.0005776 0.7491271 2.208718 15 9551, 45….
REACTOME_CELL_CYCLE_CHECKPOINTS 0.0000448 0.0006819 0.4353824 1.932005 93 6119, 23….
BIOCARTA_COMP_PATHWAY 0.0000603 0.0008855 -0.8456689 -2.216665 10 713, 712….
REACTOME_COMPLEMENT_CASCADE 0.0000603 0.0008855 -0.8456689 -2.216665 10 713, 712….
REACTOME_INITIAL_TRIGGERING_OF_COMPLEMENT 0.0000664 0.0009567 -0.8953598 -2.087885 7 713, 712….
KEGG_DNA_REPLICATION 0.0000795 0.0011258 0.5941479 2.089992 33 6119, 59….
BIOCARTA_TCRA_PATHWAY 0.0001105 0.0015136 -0.9348933 -1.933580 5 3122, 31….
REACTOME_CELL_CYCLE_MITOTIC 0.0001107 0.0015136 0.3141142 1.606072 244 9183, 61….
BIOCARTA_CLASSIC_PATHWAY 0.0001153 0.0015507 -0.8483118 -2.142589 9 713, 712….
REACTOME_HEMOSTASIS 0.0001216 0.0016090 -0.3560318 -1.716927 187 3708, 33….
REACTOME_CLASSICAL_ANTIBODY_MEDIATED_COMPLEMENT_ACTIVATION 0.0001409 0.0018336 -0.9324302 -1.928486 5 713, 712….
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_ 0.0001595 0.0020426 0.4967815 1.948640 53 1017, 59….
REACTOME_M_G1_TRANSITION 0.0002867 0.0036145 0.4890226 1.940914 55 23595, 5….
BIOCARTA_TH1TH2_PATHWAY 0.0003363 0.0041755 -0.8957833 -1.950790 6 3122, 31….
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G 0.0003630 0.0044389 0.5156384 1.959590 47 5705, 56….
REACTOME_PECAM1_INTERACTIONS 0.0004006 0.0048255 -0.8237554 -2.080567 9 4067, 25….
KEGG_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY 0.0004255 0.0050501 -0.4585651 -1.852614 62 2214, 31….
REACTOME_REPAIR_SYNTHESIS_OF_PATCH_27_30_BASES_LONG_BY_DNA_POLYMERASE 0.0005044 0.0058988 0.7119880 2.028239 14 6119, 59….
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 0.0005245 0.0060465 0.4930899 1.884001 48 1017, 65….
REACTOME_PHASE_1_FUNCTIONALIZATION 0.0006601 0.0075028 -0.9040343 -1.869757 5 4129, 69….
REACTOME_INTEGRIN_CELL_SURFACE_INTERACTIONS 0.0007459 0.0083600 -0.4604104 -1.846990 61 3680, 12….
REACTOME_MRNA_SPLICING_MINOR_PATHWAY 0.0009015 0.0099654 0.5125211 1.886393 40 6637, 10….

Now we represent the top 25 enriched pathways between AAb and ND in a dot plot (Figure 20).

Dotplot of the top 25 GSEA enriched pathways between AAb and ND.

Figure 20: Dotplot of the top 25 GSEA enriched pathways between AAb and ND

The enriched pathways with a FDR threshold of less than 1% between T1D and ND samples are shown in Table 6 below:

Table 6: Enriched Pathways between T1D and ND samples
Pathway PValue PadjValue ES NES Size LeadingEdge
KEGG_OXIDATIVE_PHOSPHORYLATION 0.0000000 0.0000000 -0.6137227 -2.703860 112 51382, 5….
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION 0.0000000 0.0000000 -0.5591489 -2.574392 150 6514, 41….
REACTOME_REGULATION_OF_INSULIN_SECRETION 0.0000000 0.0000000 -0.5278999 -2.498343 185 6514, 41….
KEGG_PARKINSONS_DISEASE 0.0000000 0.0000000 -0.5779166 -2.561744 114 7345, 46….
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM 0.0000000 0.0000000 -0.4857029 -2.317466 206 6514, 26….
REACTOME_SIGNALING_IN_IMMUNE_SYSTEM 0.0000000 0.0000000 0.4473025 2.320561 203 3106, 31….
KEGG_ALLOGRAFT_REJECTION 0.0000000 0.0000000 0.9043587 2.719197 16 3106, 31….
KEGG_AUTOIMMUNE_THYROID_DISEASE 0.0000000 0.0000000 0.9043587 2.719197 16 3106, 31….
KEGG_GRAFT_VERSUS_HOST_DISEASE 0.0000000 0.0000000 0.9004719 2.677130 15 3106, 31….
KEGG_VIRAL_MYOCARDITIS 0.0000000 0.0000001 0.6704811 2.623891 44 3106, 31….
REACTOME_CLASSICAL_ANTIBODY_MEDIATED_COMPLEMENT_ACTIVATION 0.0000000 0.0000002 0.9920361 2.062466 5 713, 716….
KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION 0.0000000 0.0000002 0.6524344 2.569837 45 972, 310….
REACTOME_ELECTRON_TRANSPORT_CHAIN 0.0000000 0.0000004 -0.5868998 -2.375017 71 4698, 54….
KEGG_LEISHMANIA_INFECTION 0.0000000 0.0000005 0.6404309 2.522557 45 3122, 31….
KEGG_INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION 0.0000000 0.0000005 0.7889566 2.584008 21 3122, 31….
REACTOME_IMMUNOREGULATORY_INTERACTIONS_BETWEEN_A_LYMPHOID_AND_A_NON_LYMPHOID_CELL 0.0000000 0.0000008 0.7606774 2.555285 24 3106, 31….
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION 0.0000000 0.0000008 -0.4946990 -2.223067 122 90701, 6….
KEGG_CELL_ADHESION_MOLECULES_CAMS 0.0000000 0.0000015 0.5402633 2.374274 76 3106, 31….
KEGG_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY 0.0000000 0.0000015 0.5689888 2.403614 62 3106, 31….
KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION 0.0000000 0.0000017 0.5101788 2.292957 87 1436, 14….
REACTOME_MITOTIC_M_M_G1_PHASES 0.0000002 0.0000093 -0.4692216 -2.098299 121 5684, 63….
KEGG_HUNTINGTONS_DISEASE 0.0000003 0.0000096 -0.4418634 -2.040986 151 5332, 46….
KEGG_ALZHEIMERS_DISEASE 0.0000004 0.0000127 -0.4519475 -2.052732 136 5332, 46….
REACTOME_CELL_CYCLE_MITOTIC 0.0000005 0.0000154 -0.3793156 -1.862349 244 7846, 61….
KEGG_PROTEIN_EXPORT 0.0000010 0.0000307 -0.7323547 -2.292166 23 90701, 6….
REACTOME_CELL_CYCLE_CHECKPOINTS 0.0000012 0.0000380 -0.4761054 -2.039567 93 6119, 56….
REACTOME_G1_S_TRANSITION 0.0000016 0.0000476 -0.5022694 -2.104512 83 6119, 56….
KEGG_ERBB_SIGNALING_PATHWAY 0.0000024 0.0000686 0.5021515 2.168703 71 2064, 20….
REACTOME_DNA_REPLICATION_PRE_INITIATION 0.0000025 0.0000686 -0.5234875 -2.083802 67 6119, 56….
REACTOME_TCR_SIGNALING 0.0000031 0.0000845 0.5915814 2.268108 41 3122, 31….
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A 0.0000038 0.0000995 -0.5513104 -2.135964 58 5684, 57….
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE 0.0000043 0.0001080 -0.5195882 -2.057018 65 5684, 57….
KEGG_N_GLYCAN_BIOSYNTHESIS 0.0000055 0.0001333 -0.5924605 -2.148438 42 1603, 85….
BIOCARTA_ASBCELL_PATHWAY 0.0000062 0.0001454 0.9606235 1.997158 5 3122, 31….
REACTOME_SYNTHESIS_OF_DNA 0.0000063 0.0001454 -0.5010712 -2.088059 78 6119, 56….
REACTOME_S_PHASE 0.0000077 0.0001719 -0.4774796 -2.022385 89 6119, 56….
KEGG_PATHWAYS_IN_CANCER 0.0000082 0.0001740 0.3294146 1.728251 224 27148, 1….
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE 0.0000080 0.0001740 -0.5771810 -2.118154 46 5684, 57….
KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS 0.0000096 0.0001994 0.5496324 2.174887 47 3122, 31….
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE 0.0000116 0.0002331 -0.5806178 -2.105493 42 5684, 57….
KEGG_BASAL_CELL_CARCINOMA 0.0000139 0.0002541 0.6686341 2.205788 22 27148, 2….
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX 0.0000137 0.0002541 -0.5636963 -2.100732 48 5684, 57….
REACTOME_GENERATION_OF_SECOND_MESSENGER_MOLECULES 0.0000130 0.0002541 0.7356972 2.335570 19 3122, 31….
REACTOME_PHOSPHORYLATION_OF_CD3_AND_TCR_ZETA_CHAINS 0.0000136 0.0002541 0.8007130 2.211635 13 3122, 31….
REACTOME_STABILIZATION_OF_P53 0.0000149 0.0002619 -0.5655961 -2.075640 46 5684, 57….
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G 0.0000147 0.0002619 -0.5651471 -2.094956 47 5684, 57….
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC 0.0000157 0.0002701 -0.5397576 -2.069112 56 5684, 57….
BIOCARTA_CLASSIC_PATHWAY 0.0000194 0.0003193 0.8685597 2.191558 9 713, 716….
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1 0.0000192 0.0003193 -0.5614160 -2.081125 47 5684, 57….
KEGG_ASTHMA 0.0000206 0.0003330 0.8070557 2.213283 12 3122, 31….
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_ 0.0000226 0.0003398 -0.5361763 -2.053075 53 5684, 57….
REACTOME_M_G1_TRANSITION 0.0000226 0.0003398 -0.5410323 -2.080309 55 5684, 57….
REACTOME_TRANSLOCATION_OF_ZAP70_TO_IMMUNOLOGICAL_SYNAPSE 0.0000227 0.0003398 0.8508645 2.215076 10 3122, 31….
REACTOME_TRNA_AMINOACYLATION 0.0000220 0.0003398 -0.5934683 -2.122662 39 6897, 26….
BIOCARTA_PROTEASOME_PATHWAY 0.0000241 0.0003537 -0.7386208 -2.210451 19 5684, 61….
REACTOME_METABOLISM_OF_AMINO_ACIDS 0.0000255 0.0003670 -0.4175072 -1.880223 127 384, 494….
REACTOME_INITIAL_TRIGGERING_OF_COMPLEMENT 0.0000306 0.0004329 0.9058373 2.084542 7 713, 716….
BIOCARTA_BLYMPHOCYTE_PATHWAY 0.0000345 0.0004591 0.9180079 1.993582 6 3122, 36….
KEGG_ENDOCYTOSIS 0.0000340 0.0004591 0.3573656 1.779197 154 3106, 31….
KEGG_TYPE_I_DIABETES_MELLITUS 0.0000338 0.0004591 0.6504828 2.145908 22 3106, 31….
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 0.0000347 0.0004591 -0.5458952 -2.034392 48 5684, 57….
REACTOME_METABOLISM_OF_PROTEINS 0.0000461 0.0006005 -0.3684086 -1.756189 193 6903, 78….
REACTOME_SIGNALING_BY_WNT 0.0000469 0.0006009 -0.5155486 -1.997411 58 5684, 57….
BIOCARTA_INFLAM_PATHWAY 0.0000682 0.0008594 0.8322496 2.166616 10 3122, 31….
BIOCARTA_MAPK_PATHWAY 0.0000727 0.0009032 0.4244014 1.873016 79 4214, 13….
KEGG_PROTEASOME 0.0000773 0.0009453 -0.5651386 -2.044756 41 5684, 57….
BIOCARTA_COMP_PATHWAY 0.0000812 0.0009636 0.8287682 2.157552 10 713, 716….
REACTOME_COMPLEMENT_CASCADE 0.0000812 0.0009636 0.8287682 2.157552 10 713, 716….
REACTOME_INNATE_IMMUNITY_SIGNALING 0.0000908 0.0010622 0.4182879 1.861669 82 713, 716….
REACTOME_CYTOSOLIC_TRNA_AMINOACYLATION 0.0001079 0.0012437 -0.6527627 -2.043055 23 6897, 26….
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN 0.0001119 0.0012547 -0.5015174 -1.937437 57 5684, 57….
REACTOME_PD1_SIGNALING 0.0001106 0.0012547 0.7451292 2.148040 14 3122, 31….
KEGG_MAPK_SIGNALING_PATHWAY 0.0001343 0.0014848 0.3253598 1.665817 187 2263, 42….
KEGG_AMINOACYL_TRNA_BIOSYNTHESIS 0.0001496 0.0016311 -0.5491910 -1.967733 40 6897, 26….
REACTOME_METABOLISM_OF_CARBOHYDRATES 0.0001620 0.0017426 -0.4262994 -1.837159 95 6514, 41….
KEGG_COMPLEMENT_AND_COAGULATION_CASCADES 0.0001810 0.0019224 0.5141089 1.994053 43 710, 713….
BIOCARTA_CD40_PATHWAY 0.0001990 0.0020750 0.7449371 2.057577 13 958, 421….
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS 0.0002006 0.0020750 -0.4025476 -1.783636 113 9844, 56….
REACTOME_FACILITATIVE_NA_INDEPENDENT_GLUCOSE_TRANSPORTERS 0.0002354 0.0024051 -0.8450700 -1.980798 8 6514, 15….
REACTOME_DOWNSTREAM_TCR_SIGNALING 0.0002582 0.0026047 0.5765914 2.009129 29 3122, 31….
REACTOME_GLUCONEOGENESIS 0.0002906 0.0028954 -0.5999420 -1.933853 26 4191, 52….
REACTOME_GLUCOSE_METABOLISM 0.0003022 0.0029742 -0.5151851 -1.909751 47 4191, 52….
REACTOME_RHO_GTPASE_CYCLE 0.0003076 0.0029906 0.3747026 1.720242 96 4952, 36….
REACTOME_AXON_GUIDANCE 0.0003398 0.0032643 0.3495046 1.682180 128 54209, 5….
BIOCARTA_LAIR_PATHWAY 0.0004094 0.0038867 0.7947663 2.069035 10 3689, 71….
REACTOME_HIV_INFECTION 0.0004181 0.0039232 -0.3520578 -1.653085 172 9844, 56….
REACTOME_ACTIVATION_OF_CHAPERONES_BY_IRE1_ALPHA 0.0004449 0.0041270 -0.8005015 -1.996247 10 10130, 4….
BIOCARTA_TCRA_PATHWAY 0.0005032 0.0046150 0.8973349 1.865580 5 3122, 31….
REACTOME_TELOMERE_MAINTENANCE 0.0005958 0.0054027 -0.5079837 -1.842100 42 3015, 61….
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING 0.0006193 0.0055531 -0.7067907 -1.950160 15 515, 516….
KEGG_VIBRIO_CHOLERAE_INFECTION 0.0006399 0.0056744 -0.5107938 -1.859700 43 51382, 5….
BIOCARTA_TNFR2_PATHWAY 0.0006677 0.0058565 0.6805632 2.023335 15 4214, 71….
BIOCARTA_EPO_PATHWAY 0.0006837 0.0059330 0.6677861 2.102247 18 3717, 20….
REACTOME_GLYCOLYSIS 0.0006983 0.0059947 -0.6696323 -1.947766 17 5208, 52….
BIOCARTA_IL3_PATHWAY 0.0007130 0.0060572 0.7161174 1.977975 13 3717, 66….
KEGG_HEMATOPOIETIC_CELL_LINEAGE 0.0007761 0.0065240 0.5384894 1.940742 31 3122, 14….
REACTOME_MITOTIC_PROMETAPHASE 0.0008870 0.0073793 -0.4447900 -1.755046 64 6396, 55….
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC 0.0009741 0.0080215 -0.6078353 -1.885981 22 7846, 74….
KEGG_CITRATE_CYCLE_TCA_CYCLE 0.0011144 0.0090840 -0.5486226 -1.826113 29 4191, 41….
BIOCARTA_GLEEVEC_PATHWAY 0.0012134 0.0097923 0.5666274 1.869273 22 4214, 37….

Finally we represent the top 25 enriched pathways between T1D and ND in a dot plot (Figure 21).

Dotplot of the top 25 GSEA enriched pathways between T1D and ND.

Figure 21: Dotplot of the top 25 GSEA enriched pathways between T1D and ND

5 Discussion

To address the main question of our project, which involved understanding potential differential expression patterns between the studied groups, a methodical and thorough analysis was carried out. In this context, this work revealed no differentially expressed genes between patients suffering from diabetes (T1D) and auto-antibody individuals (AAb), suggesting that the two conditions are similar. On the other hand, only two DE genes were found between AAb and control (ND), in comparison with the six total genes found in the reference study. In our analysis, one gene is upregulated (LINC01099) and one downregulated (CD74). Interestingly, LINC01099 is a non-coding RNA which was not detected in Campbell-Thompson et al. (2021) work. However, previous studies have reported its presence and it was considered to be expressed mainly in β-pancreatic cells (Segerstolpe et al. (2016)). In the case of autoimmune diseases related genes, CD74 stands out. It is also known for being part of the class II major histocompatibility complex (MHC). Our analysis indicated that CD74 expression was significantly higher in both AAb and T1D donors compared to ND donors, which is also consistent with the findings in the reference paper. Furthermore, this gene was previously reported to have increased expression in β-cells coming from T1D patients. This study claims to be the first one detecting increased expression of this gene specifically in AAb donors, which is supported as well by our results.

In the case of T1D and ND, more than 2000 genes were found to be differentially expressed, in comparison with the 89 DE genes shown in the paper. Between the top genes we found several autoimmune disease-related genes. As expected, genes implicated in endrocine system development, regulation and insulin secretion were also significant, giving raise to an enrichment in pathways of the same characteristics. Other genes were found, such as MT1G, which translates into a metallothionein 1 (MT1) protein that binds a variety of metals. MT1G has been previously reported in β-cell biology as a negative regulator of glucose-stimulated insulin secretion (Bensellam, Laybutt, and Jonas (2021)).

Furthermore, a relation between neurological-related pathways and type-1 diabetes is highlighted. In this context, some DE genes found in our analyses include NEURL3, LRRTM3 and NPTX2. Interestingly, the latter, NPTX2 (Neuronal Pentraxin 2 gene, down-regulated), becomes progressively decreased with age (Soares Bispo Santos Silva et al. (2015)). Decreased levels of this modulator are associated with synapse loss and therefore, cognitive decline. It is essential for synaptic plasticity and inhibitory-excitatory balance in the central nervous system (Steenoven et al. (2020)). In this context, NPTX2 levels are also reduced in diabetic β-cells, as supported by previous findings, which is also shown in our report. In the same tone, several studies have reported strong relationship between diabetes disease and Alzheimer’s disease (which is, in fact, considered a comorbid condition with diabetes) (Alamro et al. (2023)), as well as Parkinson’s disease. Indeed, two of the KEGG enriched pathways found significant in our dataset are related to these two neuropathologies (KEGG_ALZHEIMERS_DISEASE, KEGG_PARKINSONS_DISEASE).

However, we were not able to replicate all the enriched pathways related to neurological or synaptic processes such as Dopamine Receptor Signaling or α-Adrenergic Signaling highlighted in the article, which may be due to the fact that different approaches were chosen for the enrichment analysis. Nevertheless, neurological processes are undoubtedly affected in this disease.

Some of the differences seen when obtaining the results in comparison with the reference paper could be due to the fact that authors did not filter lowly expressed genes, and apparently did not use covariates as well. In this sense, our analysis involved examining how samples group using multidimensional scaling (MDS) and hierarchical clustering, annotated by disease state, sex, BMI, age, and cause of death. In the MDS analysis, a prominent clustering pattern with a main cluster and three outliers, including sample T1D_6109 (a 5-year-old) was observed. Moreover, no clear clustering based on the variables examined was identified. The distinct outlier suggests significant deviations in expression profiles, warranting further investigation into potential age-related expression patterns or technical artifacts.

According to the Hierarchical Clustering Analysis, no distinct clustering pattern emerged when using sex, indicating no segregation based on gender. In the case of BMI, some clustering was observed, particularly with a concentration of underweight samples, but other BMI categories did not show clear clusters, suggesting additional contributing factors.

Interestingly, a discernible age-based clustering pattern was found, indicating age might be a confounding variable influencing expression profiles. That is why we suggest careful consideration to avoid bias, though further validation is needed to confirm the potential effect age may have in this disease.

6 Conclusions

-The analysis revealed no differentially expressed genes between diabetes (T1D) and auto-antibody (AAb) individuals, suggesting that these conditions share similar expression profiles.

-The present findings claim the existence of diverse biological functions affected in diabetes disease.

-Over 2000 genes were differentially expressed between T1D and ND, significantly more than the 89 genes reported in the reference study. Many of these genes are related to autoimmune diseases and endocrine system functions.

-The analysis highlighted the involvement of some neurological pathways in T1D. Enriched pathways included those related to Alzheimer’s and Parkinson’s diseases, underscoring the comorbidity between diabetes and these neurological conditions.

-Differences in results compared to the reference study might stem from filtering lowly expressed genes and incorporating covariates. MDS and hierarchical clustering analyses showed no clear segregation by sex or BMI but revealed an age-based clustering pattern. Age might act as a confounding variable influencing expression profiles, necessitating careful consideration to avoid bias in future analyses.

7 Session information

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Madrid
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] fgsea_1.30.0                GSVAdata_1.40.0            
 [3] hgu95a.db_3.13.0            GSEABase_1.66.0            
 [5] dplyr_1.1.4                 GO.db_3.19.1               
 [7] GOstats_2.70.0              graph_1.82.0               
 [9] Category_2.70.0             Matrix_1.6-5               
[11] org.Hs.eg.db_3.19.1         ggplot2_3.5.1              
[13] gridExtra_2.3               sva_3.52.0                 
[15] BiocParallel_1.38.0         genefilter_1.86.0          
[17] mgcv_1.9-1                  nlme_3.1-163               
[19] geneplotter_1.82.0          annotate_1.82.0            
[21] XML_3.99-0.16.1             AnnotationDbi_1.66.0       
[23] lattice_0.22-5              edgeR_4.2.0                
[25] limma_3.60.2                SummarizedExperiment_1.34.0
[27] Biobase_2.64.0              GenomicRanges_1.56.0       
[29] GenomeInfoDb_1.40.1         IRanges_2.38.0             
[31] S4Vectors_0.42.0            BiocGenerics_0.50.0        
[33] MatrixGenerics_1.16.0       matrixStats_1.3.0          
[35] usethis_2.2.3               here_1.0.1                 
[37] kableExtra_1.4.0            knitr_1.47                 
[39] BiocStyle_2.32.0           

loaded via a namespace (and not attached):
 [1] bitops_1.0-7            DBI_1.2.2               RBGL_1.80.0            
 [4] rlang_1.1.3             magrittr_2.0.3          compiler_4.4.0         
 [7] RSQLite_2.3.7           png_0.1-8               systemfonts_1.1.0      
[10] vctrs_0.6.5             stringr_1.5.1           pkgconfig_2.0.3        
[13] crayon_1.5.2            fastmap_1.2.0           magick_2.8.3           
[16] XVector_0.44.0          labeling_0.4.3          utf8_1.2.4             
[19] rmarkdown_2.27          UCSC.utils_1.0.0        tinytex_0.51           
[22] purrr_1.0.2             bit_4.0.5               xfun_0.44              
[25] zlibbioc_1.50.0         cachem_1.1.0            jsonlite_1.8.8         
[28] blob_1.2.4              highr_0.11              DelayedArray_0.30.1    
[31] parallel_4.4.0          R6_2.5.1                bslib_0.7.0            
[34] stringi_1.8.4           RColorBrewer_1.1-3      jquerylib_0.1.4        
[37] Rcpp_1.0.12             bookdown_0.39           tidyselect_1.2.1       
[40] splines_4.4.0           rstudioapi_0.16.0       abind_1.4-5            
[43] yaml_2.3.8              codetools_0.2-19        tibble_3.2.1           
[46] withr_3.0.0             KEGGREST_1.44.0         evaluate_0.23          
[49] AnnotationForge_1.46.0  survival_3.5-8          xml2_1.3.6             
[52] Biostrings_2.72.0       pillar_1.9.0            BiocManager_1.30.23    
[55] KernSmooth_2.23-22      generics_0.1.3          RCurl_1.98-1.14        
[58] rprojroot_2.0.4         munsell_0.5.1           scales_1.3.0           
[61] xtable_1.8-4            glue_1.7.0              tools_4.4.0            
[64] data.table_1.15.4       locfit_1.5-9.9          fs_1.6.4               
[67] fastmatch_1.1-4         cowplot_1.1.3           grid_4.4.0             
[70] colorspace_2.1-0        GenomeInfoDbData_1.2.12 cli_3.6.2              
[73] fansi_1.0.6             S4Arrays_1.4.1          viridisLite_0.4.2      
[76] svglite_2.1.3           Rgraphviz_2.48.0        gtable_0.3.5           
[79] sass_0.4.9              digest_0.6.35           SparseArray_1.4.8      
[82] farver_2.1.2            memoise_2.0.1           htmltools_0.5.8.1      
[85] lifecycle_1.0.4         httr_1.4.7              statmod_1.5.0          
[88] bit64_4.0.5            

References

Alamro, Hesham, Vladimir Bajic, Miroslav T Macvanin, Esma R Isenovic, Takashi Gojobori, Magbubah Essack, and Xin Gao. 2023. “Type 2 Diabetes Mellitus and Its Comorbidity, Alzheimer’s Disease: Identifying Critical microRNA Using Machine Learning.” Frontiers in Endocrinology 13: 1084656. https://doi.org/10.3389/fendo.2022.1084656.
Alexa, Adrian, Jörg Rahnenführer, and Thomas Lengauer. 2006. “Improved Scoring of Functional Groups from Gene Expression Data by Decorrelating GO Graph Structure.” Bioinformatics 22: 1600–1607. https://doi.org/10.1093/bioinformatics/btl140.
Bensellam, M, DR Laybutt, and JC Jonas. 2021. “Emerging Roles of Metallothioneins in Beta Cell Pathophysiology: Beyond and Above Metal Homeostasis and Antioxidant Response.” Biology (Basel) 10 (3): 176. https://doi.org/10.3390/biology10030176.
Campbell-Thompson, Martha, Elizabeth A. Butterworth, Jesse L. Boatwright, et al. 2021. “Islet Sympathetic Innervation and Islet Neuropathology in Patients with Type 1 Diabetes.” Scientific Reports 11 (1): 6562. https://doi.org/10.1038/s41598-021-85659-8.
Maglott, Donna, Jim Ostell, Kim D Pruitt, and Tatiana Tatusova. 2010. “Entrez Gene: Gene-Centered Information at NCBI.” Nucleic Acids Research 39 (suppl_1): D52–57. https://doi.org/10.1093/nar/gkq1237.
Segerstolpe, Åsa, Athanasia Palasantza, Petter Eliasson, et al. 2016. “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell Metabolism 24 (4): 593–607. https://doi.org/10.1016/j.cmet.2016.08.020.
Soares Bispo Santos Silva, D, J Antunes, K Balamurugan, G Duncan, C Sampaio Alho, and B McCord. 2015. “Evaluation of DNA Methylation Markers and Their Potential to Predict Human Aging.” Electrophoresis 36 (15): 1775–80. https://doi.org/10.1002/elps.201500137.
Steenoven, Inger van, Marleen JA Koel-Simmelink, Leonieke JM Vergouw, et al. 2020. “Identification of Novel Cerebrospinal Fluid Biomarker Candidates for Dementia with Lewy Bodies: A Proteomic Approach.” Molecular Neurodegeneration 15 (1): 36. https://doi.org/10.1186/s13024-020-00388-2.
Ziemann, Mark, Antony Kaspi, and Assam El-Osta. 2019. “Digital Expression Explorer 2: A Repository of Uniformly Processed RNA Sequencing Data.” GigaScience 8 (4): giz022. https://doi.org/10.1093/gigascience/giz022.